========================================================

The wonderful world of white wines! The sugary drink that almost anybody could love. In this summary, I will be comparing some crucial factors of white wines to figure out what makes them so good. So get ready, and lets dive in

Univariate Plots Section

## [1] 4898   13
##     X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1   1           7.0             0.27        0.36           20.7     0.045
## 2   2           6.3             0.30        0.34            1.6     0.049
## 3   3           8.1             0.28        0.40            6.9     0.050
## 4   4           7.2             0.23        0.32            8.5     0.058
## 5   5           7.2             0.23        0.32            8.5     0.058
## 6   6           8.1             0.28        0.40            6.9     0.050
## 7   7           6.2             0.32        0.16            7.0     0.045
## 8   8           7.0             0.27        0.36           20.7     0.045
## 9   9           6.3             0.30        0.34            1.6     0.049
## 10 10           8.1             0.22        0.43            1.5     0.044
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   45                  170  1.0010 3.00      0.45     8.8
## 2                   14                  132  0.9940 3.30      0.49     9.5
## 3                   30                   97  0.9951 3.26      0.44    10.1
## 4                   47                  186  0.9956 3.19      0.40     9.9
## 5                   47                  186  0.9956 3.19      0.40     9.9
## 6                   30                   97  0.9951 3.26      0.44    10.1
## 7                   30                  136  0.9949 3.18      0.47     9.6
## 8                   45                  170  1.0010 3.00      0.45     8.8
## 9                   14                  132  0.9940 3.30      0.49     9.5
## 10                  28                  129  0.9938 3.22      0.45    11.0
##    quality
## 1        6
## 2        6
## 3        6
## 4        6
## 5        6
## 6        6
## 7        6
## 8        6
## 9        6
## 10       6
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Above, you will see the dimensions, first 10 lines, and a summary of each column. Looking at the above, you will notice that column X goes from 1 to 4898, which is how many observations we have. Because of this, I made the X column a factor, to label each individual wine.

With this data layed out, I am able to see the data and sections but unable to really visualize any information. Now I am going to move on to some univariate plots to display the data so I can visualize and analyze further.

Quality

In our dataset, the “quality” variable ranges between 3 and 9 with a median of 6, so there is neither very bad nor very excellent wine but mostly averge wines. Also, there are only 25 wines rated either 3 or 9.

Fixed Acidity

The basic histogram shows that fixed acidity has really few values less than 3 and a long tail after 10. So I limit the x axis range. Changing binwidth also shows more clearly that the majority of the fixed acidities fall between 5.5 and 8.5.

Volatile Acidity

After adjusting bin width, I can see that most wines have an acetic acid between 0.15-0.4g/l, with a median value at 0.28g/l.

Citric Acid

The majority of citric acidity level fall between 0.15-0.5g/l with a spike at the level of 0.49g/l. In contrast to volatile acidity, citric acidity add freshness to the wine.

Chlorides

Most wines has an amount of sodium chloride between 0.025-0.06g/l, with a median of 0.043g/l. The highest level in this dataset is 0.346g/l.

Free Sulfur Dioxide

The median value of free sulfur dioxide is 34 mg/l and it has a wide range from 2 to 289 mg/l with the majority of the value falling between 10-55 mg/l. Since free sulfur dioxide becomes noticeable at 50 mg/l, I assume it will affect the taste.

Total Sulfur Dioxide

Similar to free sulfur dioxide, total sulfur dioxide also has a wide range from 9 to 440 mg/l with a median value at 134 mg/l.

Density

Density is a very small range between 0.985 and 1.005.

pH

the pH is between 2.72 and 3.82, which means that wine is on the acidic side.

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

Alcohol

An appropriate level of alcohol enhances the flavor but a high level of alcohol will cause a negative burning sensation. But our white wine dataset doesn’t appear to have very high alcohol level. The median is 10.4% and the majority of values fall between 9% to 13%.

Residual Sugar

Residual sugar has a wide range between 0.6-65.8g/l while the median is only 5.2g/l. This is because wine producers try to cater to varying consumers’ preference of sweetness. Some people like me favor sweet wines, while others might prefer bone dry.

Summary of Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Summary of Residual Sugar Log10

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2218  0.2304  0.7160  0.6432  0.9956  1.8180

Summary of Residual Sugar squared

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7746  1.3040  2.2800  2.3200  3.1460  8.1120

Transforming Data

Univariate Analysis

The structure of the dataset:

There are 4,898 different white wines in the data, with 11 features, all affecting the quality of the wine. In this study, I will look at the factors that are of most interest to me to find out if this will have any effect on the quality of the wine. The factor variable has been assigned to the quality, based on a level from 3 to 9, with 9 being the highest quality.
Some things that have been observed already is that the residual sugar of the wine quite lower than expected for white wines, with almost 2/3 of the wines in the list between 2 and 4 residual sugar. Also, the pH seems to a normal distribution.

What is/are the main feature(s) of interest?

The main feature of this dataset is the quality of the wine. All of the data revolves around the quality, with certain areas of interest such as the pH, residual sugar, and the acidity.

Bivariate Plots Section

## (7.99,9.24] (9.24,10.5] (10.5,11.7]   (11.7,13]   (13,14.2] 
##         845        1730        1390         795         138
## 
##  Descriptive statistics by group 
## group: (7.99,9.24]
##    vars   n mean  sd median trimmed  mad min  max range skew kurtosis se
## X1    1 845 0.28 0.1   0.27    0.27 0.07 0.1 0.82  0.71 1.47      3.2  0
## -------------------------------------------------------- 
## group: (9.24,10.5]
##    vars    n mean  sd median trimmed  mad  min max range skew kurtosis se
## X1    1 1730 0.28 0.1   0.26    0.27 0.09 0.08   1  0.92  1.7     5.87  0
## -------------------------------------------------------- 
## group: (10.5,11.7]
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1390 0.26 0.09   0.24    0.25 0.07 0.09 0.96  0.88 1.79     6.76
##    se
## X1  0
## -------------------------------------------------------- 
## group: (11.7,13]
##    vars   n mean  sd median trimmed  mad  min max range skew kurtosis se
## X1    1 795  0.3 0.1   0.29    0.29 0.07 0.08 1.1  1.02 1.45     6.37  0
## -------------------------------------------------------- 
## group: (13,14.2]
##    vars   n mean   sd median trimmed mad  min  max range skew kurtosis
## X1    1 138 0.37 0.12   0.35    0.36 0.1 0.15 0.78  0.64 0.87     1.18
##      se
## X1 0.01

Bivariate Analysis

Relationships:

There are a lot of interesting relationships in the data, such as alcohol vs quality. It is very interesting that the quality of the wine actually increases as the amount of alcohol increases. There are also interesting relationships between the residual sugar vs. density and the relationship of valatile acidity compared to alcohol.

Lets dig deeper into these relationships and explore some multvariate analysis to see where this takes us in the wine exploration

Multivariate Plots Section

## (7.99,10.1] (10.1,12.1] (12.1,14.2] 
##        2086        2154         658

When I break down quality by alcohol level and volatile acidity, for alcohol group between 7.99-10.1%, the negative relationship between volatile acidity and quality becomes the strongest. For example, among 7.99-10.1% alcohol categories, the median value of volatile acidity decreases from 0.34 g/l for less desirable wines (quality = 4) to 0.19 g/l for highly rated ones (quality = 8); the former group also has a higher variation of volatile acidity (sd = 0.31) than the latter one (sd = 0.03).


Final Plots and Summary

Plot One

## 
##  Descriptive statistics by group 
## group: 3
##    vars  n  mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 20 10.35 1.22  10.45   10.34 1.19   8 12.6   4.6 0.02    -0.83
##      se
## X1 0.27
## -------------------------------------------------------- 
## group: 4
##    vars   n  mean sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 163 10.15  1   10.1   10.08 1.04 8.4 13.5   5.1  0.7     0.15 0.08
## -------------------------------------------------------- 
## group: 5
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 1457 9.81 0.85    9.5    9.71 0.74   8 13.6   5.6 1.08     1.07
##      se
## X1 0.02
## -------------------------------------------------------- 
## group: 6
##    vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 2198 10.58 1.15   10.5   10.52 1.33 8.5  14   5.5  0.4    -0.72
##      se
## X1 0.02
## -------------------------------------------------------- 
## group: 7
##    vars   n  mean   sd median trimmed  mad min  max range skew kurtosis
## X1    1 880 11.37 1.25   11.4   11.42 1.33 8.6 14.2   5.6 -0.3    -0.56
##      se
## X1 0.04
## -------------------------------------------------------- 
## group: 8
##    vars   n  mean   sd median trimmed  mad min max range  skew kurtosis
## X1    1 175 11.64 1.28     12   11.78 1.19 8.5  14   5.5 -0.89     0.01
##     se
## X1 0.1
## -------------------------------------------------------- 
## group: 9
##    vars n  mean   sd median trimmed mad  min  max range  skew kurtosis
## X1    1 5 12.18 1.01   12.5   12.18 0.3 10.4 12.9   2.5 -0.98    -1.03
##      se
## X1 0.45

Description One

Comparing wine quality vs. alcohol level is probaly what is the highest priority by wine makers. There are very few wines with a high quality, the majority is around quality level 3. But with a higher quality also comes a higher mean alcohol level and also a lower density. Alcohol level has a relatively small range from 8% to 14.2% and a median value at 10.4%. The majority of our wine ratings fall between 5-7. Except for rating 4 category probably due to relative small sample size, a better-rated wine has a higher alcohol level (the left chart).

Plot Two

## 
##  Descriptive statistics by group 
## group: (7.99,9.24]
##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 845 41.51 15.79     42   41.34 16.31   5 128   123 0.32     0.93
##      se
## X1 0.54
## -------------------------------------------------------- 
## group: (9.24,10.5]
##    vars    n  mean    sd median trimmed   mad min   max range skew
## X1    1 1730 37.54 18.09     36   36.72 19.27   3 138.5 135.5  0.6
##    kurtosis   se
## X1     0.78 0.44
## -------------------------------------------------------- 
## group: (10.5,11.7]
##    vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 1390 32.28 17.19     31   31.09 14.83   2 289   287 3.34    38.23
##      se
## X1 0.46
## -------------------------------------------------------- 
## group: (11.7,13]
##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 795 30.44 12.74     30   29.96 11.86   3  96    93 0.77     2.48
##      se
## X1 0.45
## -------------------------------------------------------- 
## group: (13,14.2]
##    vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 138 27.88 12.14     28   27.35 13.34   3  65    62  0.4    -0.07
##      se
## X1 1.03
## 
##  Descriptive statistics by group 
## group: (7.99,9.24]
##    vars   n mean  sd median trimmed  mad min  max range skew kurtosis se
## X1    1 845 0.28 0.1   0.27    0.27 0.07 0.1 0.82  0.71 1.47      3.2  0
## -------------------------------------------------------- 
## group: (9.24,10.5]
##    vars    n mean  sd median trimmed  mad  min max range skew kurtosis se
## X1    1 1730 0.28 0.1   0.26    0.27 0.09 0.08   1  0.92  1.7     5.87  0
## -------------------------------------------------------- 
## group: (10.5,11.7]
##    vars    n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 1390 0.26 0.09   0.24    0.25 0.07 0.09 0.96  0.88 1.79     6.76
##    se
## X1  0
## -------------------------------------------------------- 
## group: (11.7,13]
##    vars   n mean  sd median trimmed  mad  min max range skew kurtosis se
## X1    1 795  0.3 0.1   0.29    0.29 0.07 0.08 1.1  1.02 1.45     6.37  0
## -------------------------------------------------------- 
## group: (13,14.2]
##    vars   n mean   sd median trimmed mad  min  max range skew kurtosis
## X1    1 138 0.37 0.12   0.35    0.36 0.1 0.15 0.78  0.64 0.87     1.18
##      se
## X1 0.01

Description Two

The combined two charts below plot alcohol against free sulfur dioxide and volatile acidity. The higher the alcohol level is, the less the free sulfur dioxide will be. For example, the median free sulfur dioxide amount among 13-14.2% alcohol group is only 28 mg/l, much less than 42 mg/l among 7.99-9.24% alcohol group. The relatinship between acidity and alcohol becomes more clear among higher alcohol groups. For instance, the median volatile acidity amount among 13-14.2% alcohol group is 0.35 g/l as compared to 0.24 g/l among 10.5-11.7% alcohol group.

Plot Three

Description Three

This chart is very interesting and descriptive in many ways. It shows the averages of the log10 residual sugar, and sort of makes a bimodal histogram. It also vactors in the quality of the wine as the color and builds the gistogram on that. It seems that the higher residual sugar means a lower quality is most circumstances.

Reflection

This white wine dataset is the most tidy one I’ve ever used. However, I was frustrated in the beginning because except alcohol, almost all other input variables don’t have a strong relationship with wine quality. Reading correlation matrix is not enough. When conditioning on other relevant variables, the relationships between the physicochemical properties and quality became clear. Also, all input variables are continous variables which limited the type of graphs I could make. One solution I made was to recode to categorical variables.

The other problem I had is my knowledge about the physicochemicals and how they interacted were limited before starting this project. I had to resort to additional readings to brush up my wine knowledge.

This dataset is pretty limited with 13 input variables (technically 12 can be used for analysis because one of them is ID variable), it will be great if other variables such as grape type and wine age can be included for further investigation.